graph LR
A["Original SWE-bench<br/>2,294 tasks<br/>Some infeasible"] --> B["Human Annotation<br/>93 software developers<br/>screened 1,699 samples"]
B --> C["SWE-bench Verified<br/>500 validated tasks<br/>Clear & solvable"]
C --> D["Reliable measure<br/>of AI coding<br/>capability"]
style A fill:#e74c3c,stroke:#333,color:#fff
style B fill:#f39c12,stroke:#333,color:#fff
style C fill:#27ae60,stroke:#333,color:#fff
style D fill:#3498db,stroke:#333,color:#fff
SWE-bench Verified
A human-validated benchmark of 500 real-world GitHub issues testing whether AI can autonomously resolve software engineering problems
Keywords: SWE-bench Verified, software engineering benchmark, AI coding evaluation, GitHub issue resolution, autonomous coding agents, Princeton NLP, OpenAI Preparedness, patch generation, real-world bugs, LLM leaderboard, mini-SWE-agent, coding agent benchmark

Introduction
Writing code that passes toy programming puzzles is one thing. Fixing real bugs in production-grade open-source repositories — where you must navigate thousands of files, understand project conventions, and produce a patch that passes hidden tests — is an entirely different challenge.
SWE-bench Verified is a human-validated benchmark of 500 real-world software engineering problems drawn from GitHub issues across 12 popular Python repositories. Each task gives an AI agent access to a full codebase and an issue description, then asks it to generate a patch that resolves the problem. The patch is evaluated against hidden unit tests that the agent never sees.
“SWE-bench Verified is a subset of the original test set from SWE-bench, consisting of 500 samples verified to be non-problematic by our human annotators. This version supersedes the original SWE-bench and SWE-bench Lite test sets.” — OpenAI, Introducing SWE-bench Verified
What Is SWE-bench Verified?
The original SWE-bench (ICLR 2024 Oral) was a breakthrough: 2,294 software engineering problems sourced from real GitHub issues and pull requests across 12 Python repositories (Django, scikit-learn, sympy, matplotlib, Flask, etc.). But the original benchmark had issues — 68.3% of samples were flagged by human reviewers for underspecified problem statements, unfair unit tests, or setup problems that could reject correct solutions.
SWE-bench Verified fixes this. OpenAI’s Preparedness team collaborated with the SWE-bench authors to have 93 professional software developers manually review each sample. Only tasks with clear problem descriptions, fair evaluation tests, and reliable environments were kept — resulting in a curated set of 500 high-quality tasks.
How It Works
| Step | Description |
|---|---|
| Input | Agent receives a full codebase + GitHub issue description |
| Task | Generate a patch (code edit) that resolves the described issue |
| Evaluation | Hidden FAIL_TO_PASS tests must pass (issue resolved) AND PASS_TO_PASS tests must pass (nothing broken) |
| Metric | % Resolved — percentage of the 500 tasks fully solved |
graph TD
A["GitHub Issue<br/>Description"] --> B["AI Agent"]
C["Full Codebase<br/>(thousands of files)"] --> B
B --> D["Generated Patch<br/>(code changes)"]
D --> E{"Hidden Tests"}
E -->|"FAIL_TO_PASS ✅<br/>PASS_TO_PASS ✅"| F["✅ Resolved"]
E -->|"Any test fails"| G["❌ Not Resolved"]
style A fill:#3498db,stroke:#333,color:#fff
style C fill:#3498db,stroke:#333,color:#fff
style B fill:#9b59b6,stroke:#333,color:#fff
style D fill:#f39c12,stroke:#333,color:#fff
style E fill:#2c3e50,stroke:#333,color:#fff
style F fill:#27ae60,stroke:#333,color:#fff
style G fill:#e74c3c,stroke:#333,color:#fff
Who Built It?
SWE-bench was originally created at Princeton NLP by:
- Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan
The Verified subset was produced in collaboration with OpenAI’s Preparedness team:
- Neil Chowdhury, James Aung, Chan Jun Shern, Oliver Jaffe, Dane Sherburn, Giulio Starace, Evan Mays, and others, with senior leads Tejal Patwardhan, Kevin Liu, and Aleksander Mądry
Publication & Timeline
| Date | Milestone |
|---|---|
| October 2023 | SWE-bench paper published (arXiv:2310.06770) |
| January 2024 | Accepted as ICLR 2024 Oral |
| June 2024 | Docker-ized evaluation harness released |
| August 2024 | SWE-bench Verified released — 500 human-validated tasks |
| October 2024 | SWE-bench Multimodal released (ICLR 2025) |
| July 2025 | mini-SWE-agent scores 65% on Verified in 100 lines of Python |
What Skills Does It Test?
SWE-bench Verified evaluates the end-to-end software engineering capabilities of AI systems — from understanding a bug report to producing a working fix.
graph TD
A["SWE-bench Verified<br/>500 Real-World Tasks"] --> B["Code Understanding<br/>Navigate large codebases<br/>across multiple files"]
A --> C["Bug Diagnosis<br/>Interpret issue descriptions<br/>and reproduce bugs"]
A --> D["Patch Generation<br/>Edit code to fix issues<br/>without breaking anything"]
A --> E["Multi-File Reasoning<br/>Coordinate changes across<br/>functions, classes, files"]
A --> F["Testing Awareness<br/>Produce fixes that pass<br/>hidden unit tests"]
style A fill:#e74c3c,stroke:#333,color:#fff
style B fill:#3498db,stroke:#333,color:#fff
style C fill:#27ae60,stroke:#333,color:#fff
style D fill:#f39c12,stroke:#333,color:#fff
style E fill:#9b59b6,stroke:#333,color:#fff
style F fill:#2c3e50,stroke:#333,color:#fff
| Capability | What Is Tested |
|---|---|
| Codebase Navigation | Finding relevant files and functions in repositories with thousands of files |
| Issue Comprehension | Understanding natural-language bug reports, feature requests, and ambiguous problem descriptions |
| Code Generation | Writing correct patches that resolve issues — not just generating new code from scratch |
| Regression Safety | Ensuring fixes don’t break existing functionality (PASS_TO_PASS tests) |
| Tool Use | Interacting with execution environments, running commands, reading outputs |
| Long-Context Reasoning | Processing extremely long contexts spanning multiple files and directories |
The 12 Python Repositories
Tasks are drawn from real issues in: Django, scikit-learn, sympy, matplotlib, Flask, requests, pytest, astropy, sphinx, xarray, pylint, and seaborn.
Human-Verified Quality
The annotation campaign assessed each task on three dimensions:
- Problem Statement Clarity (scale 0–3): Is the issue well-specified? Can the agent understand what to fix?
- Test Validity (scale 0–3): Do the FAIL_TO_PASS tests fairly evaluate solutions? Or do they reject valid fixes?
- Difficulty (estimated developer time): <15 min, 15 min–1 hr, 1–4 hr, >4 hr
Any sample flagged at severity 2+ by any single annotator (out of 3) was removed — a conservative approach yielding high confidence in the remaining 500 tasks.
Dashboard — SWE-bench Verified Leaderboard
Bash-Only Model Evaluation (mini-SWE-agent)
To enable fair apples-to-apples comparison of language models, all models below are evaluated using mini-SWE-agent — a minimal 100-line bash-only agent loop with no special tools, RAG, or scaffolding. These results reflect raw LM capability when given just a bash shell and a problem.
| Rank | Model | % Resolved | Cost per Instance |
|---|---|---|---|
| 1 | Claude 4.5 Opus (high reasoning) | 76.80 | $0.75 |
| 2 | Gemini 3 Flash (high reasoning) | 75.80 | $0.36 |
| 2 | MiniMax M2.5 (high reasoning) | 75.80 | $0.07 |
| 4 | Claude Opus 4.6 | 75.60 | $0.55 |
| 5 | GPT-5.2 Codex | 72.80 | $0.45 |
| 5 | GLM-5 (high reasoning) | 72.80 | $0.53 |
| 5 | GPT-5.2 (high reasoning) | 72.80 | $0.47 |
| 8 | Claude 4.5 Sonnet (high reasoning) | 71.40 | $0.66 |
| 9 | Kimi K2.5 (high reasoning) | 70.80 | $0.15 |
| 10 | DeepSeek V3.2 (high reasoning) | 70.00 | $0.45 |
| 11 | Gemini 3 Pro | 69.60 | $0.96 |
| 12 | Claude 4.5 Haiku (high reasoning) | 66.60 | $0.33 |
| 13 | GPT-5 Mini | 56.20 | $0.05 |
Source: swebench.com — Verified leaderboard, mini-SWE-agent evaluation (consulted March 29, 2026)
Key Insights from the Results
graph TD
A["SWE-bench Verified<br/>Leaderboard Insights"] --> B["Top Tier: ~75%<br/>Claude 4.5 Opus<br/>Gemini 3 Flash<br/>MiniMax M2.5"]
A --> C["Cost Efficiency<br/>MiniMax M2.5: 75.8%<br/>at only $0.07/task"]
A --> D["Gap to 100%<br/>~25% of tasks<br/>remain unsolved"]
A --> E["Model Size Matters<br/>GPT-5 Mini: 56.2%<br/>vs GPT-5.2: 72.8%"]
style A fill:#2c3e50,stroke:#333,color:#fff
style B fill:#27ae60,stroke:#333,color:#fff
style C fill:#3498db,stroke:#333,color:#fff
style D fill:#e74c3c,stroke:#333,color:#fff
style E fill:#f39c12,stroke:#333,color:#fff
- The top tier reaches ~76% resolved — remarkable progress from Claude 2’s original 1.96% on SWE-bench in 2023
- Cost varies dramatically — MiniMax M2.5 achieves 75.8% at just $0.07/task, while Gemini 3 Pro costs $0.96 for 69.6%
- ~25% of tasks remain unsolved even by the best models — the hardest real-world bugs still defeat frontier AI
- Reasoning modes help significantly — models with “high reasoning” consistently outperform their default configurations
- Smaller models lag behind — GPT-5 Mini at 56.2% vs GPT-5.2 at 72.8% shows the importance of model scale for complex SWE tasks
Historical Context — The Rapid Rise
| Date | Best Score | Model/Agent |
|---|---|---|
| October 2023 | 1.96% | Claude 2 (original SWE-bench paper) |
| March 2024 | 12.47% | SWE-agent |
| August 2024 | 33.2% | GPT-4o + Agentless (on Verified) |
| July 2025 | 65.0% | mini-SWE-agent |
| February 2026 | 76.8% | Claude 4.5 Opus (high reasoning) |
The improvement from 1.96% → 76.8% in just over two years represents one of the fastest capability gains in any AI benchmark.
From SWE-bench to SWE-bench Verified — Why the Upgrade?
The original SWE-bench systematically underestimated model capabilities because:
graph LR
A["Problem 1<br/>Underspecified issues<br/>38.3% flagged"] --> D["SWE-bench Verified<br/>Filters out all<br/>problematic samples"]
B["Problem 2<br/>Unfair unit tests<br/>61.1% flagged"] --> D
C["Problem 3<br/>Setup failures<br/>causing false negatives"] --> D
D --> E["500 high-quality<br/>validated tasks"]
style A fill:#e74c3c,stroke:#333,color:#fff
style B fill:#e74c3c,stroke:#333,color:#fff
style C fill:#e74c3c,stroke:#333,color:#fff
style D fill:#27ae60,stroke:#333,color:#fff
style E fill:#3498db,stroke:#333,color:#fff
- 38.3% of samples had underspecified problem statements — ambiguous issues that even human developers would struggle with
- 61.1% had unit tests that could unfairly reject valid solutions — e.g., requiring exact string matches on deprecation messages not mentioned in the issue
- Overall, 68.3% of original samples were filtered out during the verification process
This doesn’t make SWE-bench Verified “easier” — it makes it fairer. Performance increases within individual difficulty categories confirm that the improvement comes from removing impossible tasks, not just easy ones.
Where to Explore
| Resource | Link |
|---|---|
| Verified Leaderboard | swebench.com/verified.html |
| Full Leaderboard (all agents) | swebench.com |
| Results Viewer | swebench.com/viewer.html |
| HuggingFace Dataset | huggingface.co/datasets/princeton-nlp/SWE-bench_Verified |
| GitHub Repository | github.com/SWE-bench/SWE-bench |
| arXiv Paper | arxiv.org/abs/2310.06770 |
| OpenAI Blog (Verified) | openai.com/index/introducing-swe-bench-verified |
| mini-SWE-agent | github.com/SWE-agent/mini-swe-agent |
| License | MIT License |
Load the Dataset
from datasets import load_dataset
swebench_verified = load_dataset(
"princeton-nlp/SWE-bench_Verified", split="test"
)Watch the Video
Please subscribe to the Vectoring AI YouTube channel for more video tutorials 🚀
References
- Jimenez, C.E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., & Narasimhan, K. (2024). SWE-bench: Can Language Models Resolve Real-World GitHub Issues? ICLR 2024. arXiv:2310.06770.
- OpenAI Preparedness. (2024). Introducing SWE-bench Verified. openai.com/index/introducing-swe-bench-verified.
- Yang, J., Jimenez, C.E., Zhang, A.L., et al. (2025). SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains? ICLR 2025. arXiv:2410.03859.
- SWE-bench Team. SWE-bench Official Leaderboard. swebench.com.
Read More
- LiveCodeBench Pro — competitive programming benchmark with contamination-free evaluation
- SimpleQA — measuring short-form factuality and hallucination in LLMs
- Humanity’s Last Exam — the hardest AI benchmark across 100+ academic disciplines
- GPQA Diamond — graduate-level science questions for expert reasoning
- ARC-AGI-2 — abstract reasoning that challenges pattern recognition beyond training data